Implement fallback to WireServer when deploying SSH keys#3603
Conversation
There was a problem hiding this comment.
Pull request overview
Adds back the ability for the default (waagent) provisioning flow to retrieve SSH key material from the WireServer Certificates package when ovf-env.xml does not contain the full public key value, aligning behavior with pre-#3484 provisioning.
Changes:
- Invoke a new
_download_ssh_keys()step during user account configuration to fetch the Certificates goal state property when needed for SSH key deployment. - Add unit tests intended to verify when the Certificates package is (and isn’t) fetched.
- Update existing ovf-env test fixtures to remove
<KeyPairs>so unrelated provisioning tests don’t implicitly require certificate downloads.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
azurelinuxagent/pa/provision/default.py |
Adds conditional WireServer certificates download prior to deploying SSH keys. |
tests/pa/test_provision.py |
Adds tests for the new certificate-download decision logic using mock_wire_protocol. |
tests/data/ovf-env.xml |
Removes <KeyPairs> from baseline fixture to avoid triggering certificate download during unrelated tests. |
tests/data/ovf-env-2.xml |
Same as above for the “pga_true” fixture. |
tests/data/ovf-env-4.xml |
Same as above for the “pga_bad” fixture. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.deploy_ssh_keypairs(ovfenv) | ||
|
|
||
| def _download_ssh_keys(self, ovfenv): | ||
| # |
There was a problem hiding this comment.
For the reason of this check, see the implementation of self.deploy_ssh_pubkeys() and self.deploy_ssh_keypairs(). In summary, the former will look for /var/lib/waagent/*.crt when ovfenv.xml doesn't have a value for the key, and the latter will look for /var/lib/waagent/*.prv when ovfenv.xml contains any key pairs.
For the format of ovfenv.xml, see the unit tests.
| if not download_certificates: | ||
| return | ||
|
|
||
| # |
There was a problem hiding this comment.
Originally, the PA would retrieve the goal state as part of protocol detection, which retries 360 times, with a 10 second period. If that fails, the process exits and is restarted by the system, which goes again into the 360 retries, etc.
I feel just a few tries are enough, let me know if you think otherwise; thanks.
There was a problem hiding this comment.
I checked that cloud-init also retries for a long period of time, so I decided to use the same strategy as protocol detection
| <Value>ssh-rsa AAAANOTAREALKEY== foo@bar.local</Value> | ||
| </PublicKey> | ||
| </PublicKeys> | ||
| <KeyPairs> |
There was a problem hiding this comment.
The key pairs in the unit tests are not used, and with the new code they would trigger the Certificates download. To fix this, quite a few tests need to be rewritten to mock that download. Since the keys are not used by the tests, I decided to remove them instead.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #3603 +/- ##
===========================================
+ Coverage 71.97% 73.46% +1.48%
===========================================
Files 103 122 +19
Lines 15692 18608 +2916
Branches 2486 2488 +2
===========================================
+ Hits 11295 13670 +2375
- Misses 3881 4341 +460
- Partials 516 597 +81 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| self.deploy_ssh_pubkeys(ovfenv) | ||
| self.deploy_ssh_keypairs(ovfenv) | ||
|
|
||
| def _download_ssh_keys(self, ovfenv): |
There was a problem hiding this comment.
since its not download always, can we rename to _download_ssh_keys_if_needed or something like that?
| <SSH> | ||
| <PublicKeys> | ||
| <PublicKey> | ||
| <Fingerprint>EB0C0AB4B2D5FC35F2F0658D19F44C8283E2DD62</Fingerprint> |
There was a problem hiding this comment.
Is it expected that the customer should only provide one of fingerprint and value, or is it acceptable to provide both?
There was a problem hiding this comment.
The original sample file from which I copied this one includes both. In practice, I believe only 1 would be populated.
Now, the logic in OSUtil.deploy_ssh_pubkey() prioritizes the value over the thumbprint, so in terms of the unit test that uses this new file, having both is a good test case.
| with TestProvision._create_provision_handler_with_mock_protocol() as (handler, protocol): | ||
| protocol.set_http_handlers(http_get_handler=mock_http_get) | ||
| with patch('azurelinuxagent.pa.provision.default.PROBE_INTERVAL', 0): # set the delay between retries to 0 | ||
| handler._download_ssh_keys_if_needed(ovfenv) | ||
| self.assertEqual(2 * MAX_RETRY, mock_http_get.call_count, "Expected maximum number of retries ({0}) to have been attempted".format(MAX_RETRY)) # times 2 since two cyphers are attempted for FIPS support | ||
| ssh_key_path = os.path.join(conf.get_lib_dir(), '8979F1AC8C4215827BF3B5A403E6137B504D02A4.crt') |
There was a problem hiding this comment.
won't fix, test takes .3 secs to execute and want to avoid excessive use of mocks
| # A failure to download the certificates won't necessarily produce an error. Certificates do not change often and they may have | ||
| # already been saved to disk on a previous goal state. We simply report the error and continue processing the goal_state; later on, | ||
| # already been saved to disk on a previous goal state. Re-raise the exception only if explicitly requested via the ignore_download_errors | ||
| # parameter, otherwise simply report the error and continue processing the goal_state; later on, | ||
| # before extensions are processed, the Agent checks whether the required certificates are already on disk and refreshes the goal | ||
| # state if they are not. | ||
| # state if they are not |
There was a problem hiding this comment.
won't fix; use of a boolean should be easy to understand by future maintaners
|
|
||
| xml_doc = parse_doc(xml_text) | ||
| data = findtext(xml_doc, "Data") | ||
| if data is None: |
There was a problem hiding this comment.
Why are download failures treated differently than missing data element?
Right now we continue in the case of missing data, but raise in the case of download/decryption errors (when ignore_download_error is False).
There was a problem hiding this comment.
They would be different: a download error vs a successful request with no data. The way the code looks, it seems that it is (or it was) possible to get a response without any data. I've never seen such a response.
As part of #3484, the waagent provision handler no longer pulls the entire goal state. As a side effect, when SSH keys are not available in ovfenv.xml, the fallback to the WireServer's Certificates package is no longer available. This PR adds that fallback.